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Abstract 

Connectionist models of memory storage have been studied for many years, and aim to provide insight into potential 
mechanisms of memory storage by the brain. A problem faced by these systems is that as the number of items to be stored 
increases across a finite set of neurons/synapses, the cumulative changes in synaptic weight eventually lead to a sudden 
and dramatic loss of the stored information (catastrophic interference, CI) as the previous changes in synaptic weight are 
effectively lost. This effect does not occur in the brain, where information loss is gradual. Various attempts have been made 
to overcome the effects of CI, but these generally use schemes that impose restrictions on the system or its inputs rather 
than allowing the system to intrinsically cope with increasing storage demands. We show here that catastrophic 
interference occurs as a result of interference among patterns that lead to catastrophic effects when the number of patterns 
stored exceeds a critical limit. However, when Gram-Schmidt orthogonalization is combined with the Hebb-Hopfield model, 
the model attains the ability to eliminate CI. This approach differs from previous orthogonalisation schemes used in 
connectionist networks which essentially reflect sparse coding of the input. Here CI is avoided in a network of a fixed size 
without setting limits on the rate or number of patterns encoded, and without separating encoding and retrieval, thus 
offering the advantage of allowing associations between incoming and stored patterns. PACS Nos.: 87.1 0.+e, 87.1 8.Bb, 
87.1 8.Sn, 87.1 9.La 
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Introduction 

Nervous systems have two basic requirements: they must be 
stable and thus able to generate reliable specific outputs, while at 
the same time they must be flexible to allow the output to change 
during development or as a result of experience. This is the 
"stability-plasticity dilemma" [1], and it is a concern to both 
neurobiologists who want to understand how nervous systems cope 
with constandy changing internal and external conditions, and 
those working on artificial neural networks. While not exclusively 
related to it, this problem is often considered in relation to 
memory. The analysis of memory systems has been a major focus 
of neuroscience research, but there are stiU many unanswered 
questions that need to be addressed at both the experimental and 
theoretical levels. In terms of the stability-plasticity problem, the 
question is how a system can store new input patterns across 
shared components without disturbing previously stored informa- 
tion in those components. 

One of the first considerations of this problem was highlighted 
by Bienenstock, Cooper and Munro [2], who suggested that long- 
term potentiation (LTP), a proposed mechanism for learning and 
memory [3], could suffer from an inherent instability (the BCM 
model). They suggested that in systems with a set threshold for 
plasticity the potentiation of a synapse by a particular input that 
exceeded the threshold could leave that synapse open to further 
potentiation when another, non-salient, input was presented (this 



has also been referred to as the "ongoing plasticity" problem; see 
[4]). Due to the initial potentiation of the synapse, non-salient or 
random inputs caused by a non-stationary environment could 
exceed the threshold for plasticity, resulting in the potential for 
run-away cycles of potentiation which would alter the synaptic 
changes associated with the original memory. This would 
effectively overwrite the original memory, and in biological 
systems if left unchecked, excessive activation could also lead to 
epileptogenic or excitotoxic damage and cell death [5]. The 
opposite effect could occur with long-term depression, where a 
synapse is weakened when the input falls below a depression 
threshold: in this case there could be a positive feedback loop that 
results in the successive depression of the synapse. 

While the exact relationship is not clear, a similar effect may 
occur in artificial neural networks. When the number of 
sequentially recorded/stored patterns exceeds a critical value 
there is a sudden and complete loss of previously stored inputs [6]. 
This example of retroactive interference is called catastrophic 
interference (CI) and is caused by the sharing of connections 
whose weights are changed by the presentation of specific inputs. 
As more patterns are stored the weights are changed and beyond a 
critical point new inputs erase the memory of previous inputs. If 
the memories happen to be overlapping, or correlated, which 
essentially means that several of their elements are similar (the 
mathematical meaning is explained in [7], [8]), then a particular 
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synapse may get increasingly more potentiated (or depressed), thus 
resembling the stability issues addressed in the BCM model. In 
human memory, although recently stored or retrieved memories 
are labile (e.g. [9], [10]), it is rare to find a complete disruption or 
loss of previously acquired information: a relatively small and 
gradual reduction ("graceful degradation") rather than a large 
catastrophic loss usually occurs (e.g. [1 1]; but see [12], [13], [14]). 
That a catastrophic interference like effect can be shown under 
some conditions is of interest, as it suggests a basic limitation of 
storage systems that us(" a finite (although large) number of 
components, and further that th(; brain has presumably evolved a 
way of avoiding this phenomenon, allowing new information to be 
stored without disrupting previously stored information (but see 
[15]). Understanding this capability of the brain and how it can be 
applied in artificial networks could be of interest to both the 
psychological/neurobiological and technological communities. 

Various strategies have been suggested to overcome the effects 
of CI. These include the separation of new inputs from those 
previously stored by using a cascade of synaptic states [16]; 
separate encoding and storage systems (e.g. hippocampal and 
neocortical networks; [1 7]); setting limits on the magnitude or rate 
of learning [18]); the creation of new storage components through 
neurogenesis [19]; anti-Hebbian plasticity [20]; reducing the 
overlap between different patterns by sparse coding or by limiting 
or "sharpening" the number of units used to encode an input, 
orthogonal receding of inputs, or interleaving, refreshing previ- 
ously stored inputs with the new patterns to be learnt (see [12] for 
review by French and also Guyon et al. for an orthogonalization 
like approach that involves pseudoinverse of state matrix). 
Connectionist architectures use interleaving algorithms that 
require the network to repeatedly cycle through the patterns to 
be learned; after the entire set of patterns has been presented many 
times, the network is expected to converge on an appropriate set of 
weights for the complete set. The problem of CI has also been 
addressed by curbing the growth of synaptic efficacy by putting 
bounds on plasticity (see [4]). This is biologically realistic, as it 
reflects "soft-bound" plasticity, the difiBculty of potentiating 
synapses that are initially strong [21]. While these approaches 
can overcome effects in theoretical analyses, they all have 
limitations in terms of their implementation or their biological 
relevance [22], [23]. 

The potential parallels between the stability issues in biological 
and artificial systems inspire us to study the run-away cycle of 
potentiation using strategies employed to overcome CI. The BCM 
model suggested a form of self-organising or homeostatic plasticity 
that could preser\'e function within set limits while still offering the 
possibility of directed plastic changes through a sliding plasticity 
threshold [24], [25]. This threshold would be increased after LTP 
(or decreased after long-term depression, LTD) to ensure that the 
potentiation (or depression) needed to encode relevant changes 
could occur, but further potentiation would not occur with non- 
salient or random ongoing inputs, only when the new input 
exceeded the new plasticity threshold [26], [24]. In this case the 
plasticity of the synapse would be dependent on the previous 
activity of the synapse, an example of metaplasticity [27]. 

The BCM model is an attractive and biologically plausible 
proposition for introducing bounds on synaptic plasticity that 
could help to overcome the stabihty-plasticity dilemma. However, 
as with most attempts to relate cellular and synaptic effects to 
network function (e.g. memory), while there is evidence for a 
shifting plasticity threshold the extent to which a BCM-like effect is 
involved in human memory has not been established, and the 
model has not been considered in artificial systems in the context 
of catastrophic interference. We show that when Gram-Schmidt 



orthogonalization is combined with the Hebb-Hopfield model, the 
model automatically checks the possibility of a run-away 
potentiation cycle from being set up, and thus attains the ability 
to eliminate CI. 

The model we use is extremely simplified and uses the bare 
minimum core features of the neural system we wish to study, and 
its underlying conditions. Consequently it may appear to be far 
removed from biology. However, it is analytically tractable and is 
very widely used in theoretical analyses, and it has an inherent 
property of encoding synapse-like elements that should give the 
essential science behind the phenomena we are interested in. Also 
it should generalize to more realistic models, assuming that certain 
assumptions are met (see Discussion). We believe that the insight 
we obtain from it may represent real phenomena. Because of the 
mathematical nature of the model, it is open in that it can, in 
principle, be generalized indefinitely to include realistic features. 
At every stage of its generalization (or expansion) to include a new 
realistic feature, its mathematical tractability has to be ascertained, 
and in principle the numbers that come out of solving the 
improved model should be comparable to experimental measure- 
ments. 

Inherent Bounds on Post-Synaptic Response in 
Hopfield Model 

Outline of the model 

For mathematical convenience and in line with most connec- 
tionist modeling we will consider a fully connected network in 
which each neuron is connected to all other neurons, and an 
information is spread over the entire network and stored as 
changes in synaptic eflicacy that depend on the activities of the 
pre- and the post-synaptic neurons. The same set of neurons and 
synapses are involved in storage as well as retrieval of information. 
A neuron is treated as a binary entity, which assumes values +1 
and —1 depending on whether it 'fires' or 'does not fire'. An 
information that comes to be recorded in the network is assumed 
to trigger 'firing' and 'not firing' activities among the neurons in an 
asynchronous manner: the neurons exchange signals (i.e. action 
potentials) which raise or lower the potentials on post-synaptic 
neurons, and if the net potential on a neuron exceeds its threshold 
then it fires (+1), otherwise it remains quiescent (—1). Thus, an 
information 'ju' is represented by a vector, 

l(^) = {l,-l-lX...}, (1) 

whose components are a collection of + 1 and — 1 (appearing to be 
distributed randomly) [28]. The information, represented by a 
pattern of ± 1 's spread over the network, is stored in the synapses 
according to the following learning rule, originally postulated by 
Cooper [29] to mimic Hebbian synaptic plasticity: 

11=1 

Jij is the synaptic efficacy between a pair of neurons i andj, (^*''' 
is the i'^ component of vector Sy is Kronecker delta function 
( = 0 unless i =j, when it is l),N represents the number of neurons 
in the network, and p is the number of patterns recorded in the 
network. The right hand side is divided by N to normalise the 
results so that they become independent of the size of the system, 
i.e. the number of neurons in the network (note that the length of 
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= |0,)y/2 ^^1/2^ dividing or equivalently 

each of its components, by A'^'^^ the length of the vector is 
normalised to one regardless of the size of the system). For 
simplicity we consider Jy = Jj,, though the model does not impose 
this restriction, but Ja = 0 is required for mathematical reasons 
[30]. The dij is introduced in the second term on the right hand 
side to ensure that Jii = Q. It is assumed that synaptic efficacy 
between two neurons depends on the activities of the post- and the 
pre-synaptic neurons, and following Hebb [31], since the efficacy 
is expected to be high if both neurons fire and low when one of 
them is not firing, the Jij is taken as multiplication of (f,- and ^j. 
This means that if, for example, the postsynaptic neuron fires 
independently of the presynaptic neuron the synaptic efficacy wiU 
be weakened, which has a correlate in spike timing-dependent 
plasticity in biological systems (e.g. [32]). However, biologically 
there is no correlate as to how the efficacy of fy can be increased if 
both the neurons do not fire, as rule (2) would indicate. This rule is 
referred to as Hebbian learning in spite of the above discrepancy. 
In practice, the potentiation predicted when neither neuron fires is 
often ignored by placing a bound on the synapse [33]. 

Note that the i —j synapse changes every time a pattern comes to 
be recorded and the change is added to the changes produced by 
the previous patterns. Having stored a uuniher of patterns, say pi 
we should test if they are actually stored in the synapses following 
the Hebbian prescription in (2). We can present one of the p learnt 
patterns to the network and check if it can associate with its 
original version supposedly embedded in the memory store. The 
presented pattern, say v*, wiU create local fields on different sites 
(or neurons) via the synaptic efficacies (or weights) modified in the 
course of learning j!) patterns as follows. 



(v) 



(3) 



Here i is the post-synaptic neuron, andj are the pre-synaptic 
neurons with respect to i. The 'prime' on the summation indicates 
that the sum is over aUj's except i so that the inputs from allj sites 
add up on i and self-connections Jn's are excluded. The activity or 
its absence on pre-synaptic neurons ^ represented by = + 1 and 
— 1 respectively individually influence the neuron i with weights 
Jij'^, and these influences (which can be positive or negati\'e since 
the weights as well as Cj can be positive as well as negative) add up 
on the post-synaptic neuron i to produce a n(;t (;ffect, the local 
potential hj. This local field (or potential), which is a measure of 
total post-synaptic potential (PSP) on neuron i can be positive or 
negative. If its sign matches with the sign of and such 
agreement happens on the majority of neurons (say, more than 
97%, a generally accepted level; see [34] and references therein) 
then the association is considered to be good and the pattern v is 
considered as recalled, or retrieved. 

To elaborate it we will substitute for Jy from eqn.(2). So, 



(4) 



zero; dij also serves the purpose of 'the prime' on i > so 'the 
prime' is dropped in eqn.(4). Isolating the ii = v component from 
X]''^] in the first term on the right hand side, we will get A' from 

and will be left with Further, ^Xl^^i c''''^"' give 
p/N in either case of being -1-1 or — 1. Thus, we find that. 



This rearrangement has enabled us to isolate \ whose sign is 
to be compared with that of /i*"', from a jumble of cross terms 
involving the test pattern 'v' and all the other patterns in the 
memory store represented by '/z'. This is like separating a signal 
from a jumbled mixture of cross-talks this signal has with a number 
of other signals. If c''''s happen to be mutually orthogonal, the 
cross-talks will vanish and the memories would work perfecdy 
[30]. 

Analysis of post-synaptic potential 

The sign of Aj'' (or PSP) can become unfavourable (i.e. opposite 
of fj'') due to the second term in eqn.(5) (let us call it .4). Since the 
vei:tors J*''' consist of randomly generated H-l's and — I's, each of 
the p terms in the second term in the right hand side of eqn (5) wiU 
take a fractional value, less than 1, with a random sign (+ or — ). 

Thus, for ^Y* = + \, A can take any positive or negative value 

limited by the values ofp and A^, but as long as it is greater than — 

1, 



since Xijli = J^.^W^ t^e dot-product of two vectors, and 

dij picks out from Ylf=l makes the remaining terms 



(1 -p/N), hy' will match in sign vidth if'. Similarly, for cf 
hf^ vnR match in sign with if A remains less than (1 —p/N). 
Figure 1 shows the favourable ranges of values of ^ in the form of 
shaded areas. Note that in general |*''''s are not orthogonal to 
So, the dot products ^''''.JW non-zero. In spite of the signs 
being randomly + or — the chances of .4 growing arbitrarily large, 
+ve or -ve, become increasingly large with increasing p. This 
increases the possibility of CI as explained below. 

In eqn. (5) the first term on the right hand side is like signal while 
A represents noise - note that the first term is obtained by isolating 
in eqn (4) the relevant component, i.e. i*, of the pattern being 

retrieved, i.e. the v'* vector, while the overlaps of with all the 
remaining vectors in the memory store are clubbed together in the 
second term; it is these non-zero overlaps that obfuscate the signal 
and hence act as noise. From the above we see that as long as the 

noise A can be bounded by p/N— 1) from below and by (1 —p/N) 

from above, hf^ wiU be confined between (p/N— I) and (l-p/N), 
and CI win be contained. However, as new patterns come to be 
recorded, there is no intrinsic mechanism in the Hopfield model to 
control their overlaps with the patterns already in the store and 
thereby restrict the noise A to within the above limits, and thus 
restrict hf^ to within the above favourable limits. Thus, as the 
number of patterns in the store increases the noise builds up and 
the likelihood of hf "' remaining within favourable limits reduces on 
more of the neurons (i's) in the system and CI becomes 
inescapable. These bounds on PSP can slide with the variations 
in^ and N, to make CI more susceptible or less susceptible. If^ 
increases (for a given A^ then the bounds shrink and the system 
becomes more susceptible to CI, which is understandable since the 
interference among patterns will increase as their number 
increases. On the other hand the increasing system size (such that 
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Figure 1. Schematic representation of /ij'', the post-synaptic 
potential on an arbitrary site /when one of the learnt patterns, 
V is presented to check for retrieval, versus A, the noise term in 

eqn.4. The shaded areas represent the domains where /if^ff will be 
positive definite. The bounds on slide up and down with variations 
in p and W enabling, at least in principle, plasticity to control CI to some 
extent. 

doi:l 0.1 371 /journal.pone.01 0561 9.g001 

p/N->0) would widen the gap between the bounds and reduce the 
chances of CI. 

Note that outside the above bounds A can, in principle, grow to 
very large positive or negative values, akin to runaway affects in 
the BCM model (see above). Although indefinitely large positive 
and negative values of A will keep h^lf^ >0 for = + 1 and 
^Y = ~ 1 respectively, the fact is that A takes positive or negative 
values in a seemingly uncontrolled and random manner. 
Therefore, its growth to large values is, in general, detrimental 
to retrieval (or recall) and leads to CI [34] . This will cause the run- 
away effect, which will eventually give false (or deceptive) 
associations with the feature designated by site i. 

The uncontrolled growth of hf^ on a large number of sites 
inevitably leads to catastrophic forgetting in the Hopfield model if 
the ratio p/N exceeds 0.14 (see e.g. [30]). In figure 2 we present 
the result of a simulation showing how degradation sets in in the 
quality of retrieval as p/N exceeds 0.14 (details are given in the 
following section). 

A Way Out of Catastrophic Interference 

It is our hypothesis that when a stimulus (or vector) ^ is 
presented to the system, the system orthogonalizes it with respect 
to all the vectors in the memory store and then stores the 
orthogonalized vector rj rather than the raw vector <J [7]. In real 
terms this amounts to storing the similarities and differences of the 
new vector with the old vectors. 

Suppose rf^\ rf^\ tfP^ are the orthogonalized versions of 
and they are stored in the Hebbian manner as, 

JiM=iZ(t'nf-^ijnYnY)^ (6) 

where are the components of obtained by normalising 

rf^^ as rf^^ /\'if^^\. It is not immediately obvious as to how the brain 



would perform the normalization. While there is physiological and 
behavioural (e.g. psychophysical) evidence for normalization as a 
canonical neural computation, its role and underlying mechanisms 
are still an area of intense research [35]. 

Now a new vector, comes to be recorded. Some neurons 

fire and some don't, accordingly they get values -1-1 and —1, and 
through the above /,y's, local fields, or PSP's, develop on each 
neuronal site as, 

h'r'^=j^J,jl;f^'^-Jori=\,l,...,N. (7) 

7=1 

As explained above the hf^^^'s may or may not match with 
jSO'+lj'g j-gj. values ofi, but, in any case, the system would know 
the difference (lif'^'^^—hf'^^^) on each neural site. Note that the 
computation of this difference on each site already amounts to 
orthogonalization [7], i.e. 

= (8) 

where, 

since (fiYf is of die order of 1/A^. 

The interesting new thing we point out here is that if it so 
happens that is already in the memory store, say as the v''' 

vector (1 <v<p), then f^"' will not project on to (7*"+ . .,f7^* [36], 
and the first (v— 1) terms in eqn.(9) will give (^^^^ — rf^^). Then, 

= |(v) _^v) ^4(v) (^v)_|W j |« =(1 -O(|,))?«(10) 

since if'\l^''^ = if''\if''\ So the presented |^ + " will be identified 
as with rf^+^^ on the order of zero. This would imply that 
?<^+"will not be orthogonalized and stored again, no matter how 
often it is presented. However, if it turns out that is indeed a 

new vector, which is not there in the memory store, then if^~^^^ 
win be computed according to c'tjii.(8) and will be stored in the 
synapses following the modified Hebb's learning rule (6). Some 
clarification is needed here in order to understand how Hebb- 
Hopfield model with Gram-Schmidt orthogonalization (H-H-G-S) 
scores over the conventional Hebb-Hopfield (H-H) model. 

Let j!? normalized vectors be stored {tor p/N very small, say 0.05) 
in each of the above two cases, and let a test vector that is similar 
to (but not exactly the same as) one of the /; stored vectors be 
presented to check if it associates with any of the /; storc-d \ ec:tors. 
In both the cases the test vector wiU indeed associate with one of 
those p vectors to which it resembles. This means that in H-H-G-S 
scheme the p imprinted vectors are stable in the same way as in the 
H-H scheme, i.e. they have non-zero basins of attraction [30,37], 
and that the test vector, which falls within the basin of attraction of 
one of the imprinted vectors, converges to the imprinted vector. 
Thus the attractor neural network (ANN) character typically 
attributed to H-H is preserved in H-H-G-S. 

To elaborate further we note that two processes are involved in 
this: (i) 'storage' of information (or vectors) in the synapses through 
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Load parameler 



Load parameter 

Figure 2. Simulation results for a system of 1000 neurons. (A) Hopfield networl< showing memory breal<down due to catastrophic 
interference amongst the stored patterns - the fraction of input patterns that is retrieved drops rapidly around the load parameter, p/A/ = 0.14. The 
results are shown for three sets of patterns and the inset shows the results averaged over 50 sets of patterns. (B) Hopfield network with Gram- 
Schmidt orthogonalization of the incoming patterns. All the learnt patterns are retrieved perfectly until p = N, when the retrieval fraction drops to 
zero abruptly. The inset shows magnification very close to the load parameter = 1 to highlight the abruptness of the drop. Note that the system does 
not learn the raw patterns as they are presented but their orthogonalized versions, whereas the retrieval is checked for the raw patterns. 
doi:1 0.1 371 /journal.pone.01 0561 9.g002 



eqns. (2) and (6) respectively in the two cases; and (ii) 'association' 
of a presented test vector witli one of the memorised vectors 
through prescriptions (3) and (7) respectively. The two processes 
are invoked independently in H-H in that when a new vector is 
presented we have to specify whether the process of 'storage' needs 
to be invoked or whether the vector is meant to be 'associated' 
with a vector in the memory. If it is instructed to be stored then it 
wiU be stored regardless of the extent of its similarity or difference 
with any of the vectors already in the memory. But in H-H-G-S 
the two processes are linked. 

When a new vector is presented to the H-H-G-S scheme for 
storage, it has to be first orthogonalized, and as part of 
orthogonalization it is frrst subjected to a check, through eqn.(7), 
whether it 'associates' with any of the stored vectors, and if so, with 
which one. If it falls within the basin of attraction of one of the 
stored vectors [30] then it wiU be associated with that particular 
vector in the memory store and signs of {A*^^''} will coincide with 
those of the components of that vector. In case the new vector is 
not similar to any of the stored vectors then will be an 

independent vector that holds the information of the overlaps of 
the new presented vector with all the stored vectors in a 
convoluted manner. 

The above amounts to half of the orthogonalization process. 
The process is completed with the comparison (through eqn.(8)) of 
the new presented vector with which may correspond 

either to one of the stored vectors or to a vector very different from 
any one of them. The difference calculated by eqn.(8) wiU be small 
or large depending on the two situations, but in either case this will 
tantamount to orthogonalization and the orthogonalized version 
of the new vector wiU be 'stored' according to eqn.(6). In case the 
presented new vector happens to be identical (not just similar) to a 
vector already in the memory store then, as shown in eqn.(lO), 
jjO'+l) wiJl {je identically zero. 

The H-H-G-S scheme thus appears to be close to reality in 
which when the brain encounters a new information, before 
storing it, it knows, in the background of the information already 
in its memory, that the new information is completely familiar, or 
completely unfamiliar, or partially familiar. This is accomplished 
by the first part of orthogonalization represented by eqn.(7), 
namely 'association'. 



The crucial implication in the present context of CI is that 
orthogonalization diminishes the overlap of any pattern that 
comes to be recorded with everyone of those that are already in 
the store and thus suppresses the noise A. The PSPs, /i|''^'''s on all 
the sites i, are pinned at (I -0{^))c^^'\ Since C*'' = + 1, the PSP's 
are stricdy confined within the range ((C'(^) — 1), (1 — C'(;^))). 
Thus, already familiar stimuli are blocked from stimulating the 
system again and again to cause overloading and a possible run- 
away potentiation. 

In Figure 2 we present results of our simulations showing (a) 
how the retrieval quality drops rapidly around p/N = 0.14 
signifying CI, and (b) how Gram-Schmidt orthogonalization 
overcomes catastrophic interference. We use a system comprising 
1000 neurons. Patterns are generated using pseudo-random 
number generators to assign values + 1 and — 1 to the neurons. 
The patterns are learnt sequentially and stored by changing the 
synaptic efficacy Jy and accumulating the changes as in eqn.(2). 
Soon after a pattern is stored, it is presented back to the network to 
check if it can be retrieved using the prescription elaborated in 
eqns.(3-5). Figure 2(A) shows the fraction of retrieval, i.e. the ratio, 
(no. of retrieved patterns) /(no. of learnt patterns), versus load 
parameter, which is the ratio of (no. of learnt patterns)/(total 
number of neurons), i.e.p/N. Around j!?/Af = 0.14 the fraction of 
retrieved patterns dips below 90% quite rapidly and reduces to 
almost zero around ^/Af = 0.17. The results are shown for three 
sets of input patters. The inset shows the same plot after averaging 
over 18 sets of patterns. Figwe 2(B) shows the same calculation 
after invoking Gram-Schmidt orthogonalization on the incoming 
patterns - an incoming pattern is first orthogonalized with respect 
to all the stored patterns (using eqn.(8)) and then stored, but the 
original, or the raw pattern (before orthogonalization) is tested for 
retrieval. In a system of 1000 neurons all presented patterns are 
retrieved perfectly until p = 998. For /; = 999 the fraction of 
retrieved patterns dips abrupdy to almost zero, and to exactly zero 
when p = 1000 as amplified in the inset. 

Even though by storing orthogonalized patterns the memory 
capacity appears to rise from 0.1 4A^ to almost it is important 
that we check the stability of the stored memories. As stated above 
we should do it by computing the basins of attraction for the 
memories. Using the standard definitions [30,37] we did the 
simulations for a smaller network of 100 neurons to get an idea as 
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to how the size of basin of attraction changes when we introduce 
orthogonalization. 

To get the right perspective we first did the calculations for the 
conventional Hopfield model. The network was made to learn 1 2 
randomly generated 1 00-dimensional patterns (of +1 and — 1) 
according to eqn.(2). The patterns were then picked up one by one 
and states of certain neurons were switched (from — 1 to +1 or vice 
versa) - starting with switching of state of one neuron chosen 
randomly - and it was checked if the chosen imprinted pattern, say 
v'*, could be retrieved following the prescription of eqn.(5). If the 
signs of {/l,} did not match with those of the imprinted {^*'*} then 
{hi} were fed to the right hand side of eqn.(5) as {i^,} and new 
{hi} were calculated and their signs were compared with those of 
the imprinted {^;"^}. A maximum of 10 such iterations were tried 

to check if they led to convergence to the imprinted This 
exercise was repeated for 10 samples generated by picking the 
'flipped' neuron from 10 different locations chosen randomly in 
the array of 100 neurons. 

The above procedure was repeated by switching signs of more 
and more neurons successively until the overlap of the retrieved 
pattern with the corresponding imprinted pattern fell below 100%. 
This marked the size of basin of attraction for a particular 
imprinted pattern. 

For the conventional Hopfield model the basin of attraction for 
12 imprinted patterns were distributed in a broad range from 26 to 
44, with maximum probability for basins of sizes 34 to 37. As the 
number of imprinted patterns increased beyond 10 certain 
patterns began to show absence of basin of attraction (i.e. basin 
of size zero). Beyond 14 memorised patterns the number of 
patterns with zero basin of attraction increased rapidly. 

Orthogonalization improves the situation considerably. We 
considered the same 12 patterns but stored their orthogonalized 
versions. The original patterns (before orthogonalization) were 
considered for retrieval and basins of attraction were computed for 
them. The sizes of basins ranged between 6 and 45 but were 
concentrated around 3 1 . From /; = 1 4 certain patterns begin to 
lose basin of attraction (i.e. basin of attraction of size zero) though 
with very small probability, about 0.0093. The probability 
increases quite rapidly with p, becoming 0.49 ntp — 2'i and 1.0 
when p touches 100. Thus orthogonalization presents an 
interesting scenario in which in a system of A'^ neurons up to 
(A'^— 1 ) patterns are stored and retrieved efificiendy, and therefore 
compete for space for basin of attraction. There are several 
interesting issues that need close investigation. We are in the 
process of carrying them out. 

Discussion 

Many approaches have been used to try and overcome the 
problems of the actual or predicted loss of stored information in 
memory systems, both in connectionist networks (catastrophic 
interference) and in biological systems ((-.g. ongoing plasticity, [4]; 
the stability-plasticity problem, [1]). A system has to be flexible 
enough to allow salient changes to be encoded continuously while 
at the same time being stable enough to ensure that stored changes 
persist. The approach that we show here uses a conventional 
Hopfield net\vork. It thus makes no claims to be biologically 
realistic in the sense that it includes details of neuronal or synaptic 
physiology, but we feel that this simple case allows us to address 
fundamental issues of the stability-plasticity dilemma. The 
approach that we use allows the same components to encode 
and store information. In fact, rather than try and separate stored 
and new inputs, the input is instead considered in the context of 
previously stored inputs, which means that only the similarities 



and differences of new inputs are encoded while still allowing the 
fuU memory of the input to be recalled. 

We are able to show the capability of encoding and storing a 
significandy larger number of sequential inputs than is possible 
using conventional approaches, and importantiy, allowing new 
inputs to be compared and generalized to those already in the 
store. This contrasts with the non-overlapping approaches used in 
connectionist networks in attempts to overcome catastrophic 
interference (e.g. [38]; see [12]). While separation of input patterns 
would remove catastrophic interference, it also removes the 
possibility of generalising and linking together aspects of the stored 
patterns. This could be a particular problem for learning 
categories [17]. That a pattern to be stored is compared to those 
already in the store, without ha\'ing to impose limits on the rate or 
extent of the synaptic changes, is a principal advantage of the 
orthogonalization approach that we show here. 

In human memory systems the subject learns on the 
background of previously stored information rather than isolating 
the new information from it, or overwriting the previously stored 
information (see [39]). This feature is an intrinsic component that 
arises from Gram-Schmidt orthogonalisation rather than having to 
be imposed from outside. This could allow artificial, and in 
principle biological systems, to make use of an intrinsic principle of 
physical systems, ensuring that a system that includes this 
automatically has this advantage built in. An orthogonalization 
based neural system acts in a self-organized manner - it compares 
the new with old, isolates the similarities and differences of the new 
input with the old, deduces whether the new is unknown or 
known, and if it is found to be known to it then it refuses to 
entertain it a second time. In this way it acts as a form of "internal 
supervisor" [4], determining which synapses have to change to 
store the new memory while not destroying the changes at 
synapses that have previously stored information. A stimulus may 
be presented any number of times but if the input has already been 
stored then the postsynaptic local field wiU not change and 
therefore they wiU not build up incessantly in the same direction to 
cause the possible run-away effect, akin to that suggested by the 
BCM model. 

Orthogonalisation has been used pre\'iously in attempts to 
overcome the problems of catastrophic interference in connec- 
tionist networks (see, for example, [40]). However, the use of the 
term orthogonalisation in this context differs to the way that we 
have used it, where information is represented by a vector and 
orthogonalization makes the vector of a new information 
perpendicular to the vectors representing the stored information. 
Orthogonalized, or mutually perpendicular, vectors do not overlap 
with each other. This orthogonalization scheme must be 
distinguished from the 'orthogonalization' approach that is 
typically used in the learning and memory literature (e.g. [41], 
[22], [6] and references therein). The latter generally refers to 
sparse coding of information in the network, i.e., two different 
pieces of information are stored on two non-overlapping sets of 
nodes in the network, thus removing the interference effect 
associated with CI. However, in the scheme presented here the 
same nodes are used. If patterns of bipolar elements are generated 
randomly, at the first glance they could be considered orthogonal 
(i.e., with zero inner product). This would be true in the 
hypothetical situation of infinite systems (when vectors have an 
infinite number of components). However, since we are always 
dealing with finite vectors, inputs of this sort will be only 
approximately orthogonal, and the inner products will be non- 
zero. This is not orthogonalization by design, and the non-zero 
overlaps mean that the signal gets submerged in the noise whenp/ 
N>0.14 [42]. The typical/common notion of orthogonal patterns 



PLOS ONE I www.plosone.org 



6 



September 2014 | Volume 9 | Issue 9 | el 0561 9 



Overcoming Catastrophic Interference 



is, thus sparsely coded non-overlapping patterns (see also [43]), 
and by whatever means it is achieved this can help reduce CI (see 
[40]). The Gram-Schmidt orthogonalization that we use differs as 
it forces the network to actively compute and convert a set of 
vectors into a mutually orthogonal set. In this process the noise 
arising due to the intrinsic overlap amongst patterns, even though 
they are generated randomly, is eliminated and the memory 
capacity increases to p/N=\ from 0.14. 

We have examined an artificial system, and the relevance of this 
effect ideally needs to be shown in an experimental system. While 
we, and others, believe that the approach can say something 
relevant to actual systems, this needs to be tested as even in 
theoretical systems effects differ as the degree of realism changes 
(see [18]). That there are sliding thresholds for plasticity is known 
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